Taming Web Sources with "Minute-Made" Wrappers

نویسندگان

  • Fabien Azavant
  • Arnaud Sahuguet
چکیده

The Web has become a major conduit to information repositories of all kinds. Today, more than 80% of information published on the Web is generated by underlying databases and this proportion keeps increasing. In some cases, database access is only granted through a Web gateway using forms as a query language and HTML as a display vehicle. In order to permit inter-operation (between Web sources and legacy databases or among Web sources themselves) there is a strong need for Web wrappers. Web wrappers share some of the characteristics of standard database wrappers but usually the underlying data sources offer very limited query capabilities and the structure of the result (due to HTML shortcomings) might be loose and unstable. To overcome these problems, we divide the architecture of our Web wrappers into three components: (1) fetching the document, (2) extracting the information from its HTML formatting, and (3) mapping the information into a structure that can be used by applications (such as mediators). Comments Database Research Group. This working paper is available at ScholarlyCommons: http://repository.upenn.edu/db_research/39 Taming Web sources with "minute-made" wrappers Arnaud Sahuguet Department of Computer and Information Science University of Pennsylvania [email protected] Fabien Azavant Ecole Nationale Sup erieure des T el ecommunications Paris, France [email protected] 1 A need for Web wrappers The Web has become a major conduit to information repositories of all kinds. Today, more than 80% of information published on the Web is generated by underlying databases and this proportion keeps increasing. In some cases, database access is only granted through a Web gateway using forms as a query language and HTML as a display vehicle. In order to permit inter-operation (between Web sources and legacy databases or among Web sources themselves) there is a strong need for Web wrappers. Web wrappers share some of the characteristics of standard database wrappers but usually the underlying data sources o er very limited query capabilities and the structure of the result (due to HTML shortcomings) might be loose and unstable. To overcome these problems, we divide the architecture of our Web wrappers into three components: (1) fetching the document, (2) extracting the information from its HTML formatting, and (3) mapping the information into a structure that can be used by applications (such as mediators). W4F is a toolkit that allows the fast generation of Web wrappers. Given a Web source, some extraction rules and some structural mappings, the toolkit generates aWeb wrapper (a Java class) that can be used as a stand-alone program or integrated into a more complex system. W4F provides a rich language (HEL: HTML Extraction Language) to express declaratively extraction rules and mappings, as well as a wysiwyg interface that allows the creator of the wrapper to pick relevant pieces of information just by clicking on them, as he sees them in his Web browser. As an illustration, we present the TV-Guide Agent that allows users to query TV movie listings by time scheduled (date, time, channel) and program content (movie genre, rating, year, cast, country, etc.). This example demonstrates real inter-operability between TV-listing information (http://tv.yahoo.com) and movie information (Internet Movie Database). 2 The architecture The architecture of our wrapper \factory" identi es three separate components: retrieval, extraction and mapping. This structure is motivated both by the particularities of Web data sources and by the desire to take advantage of re-usable functionalities. For example, wrappers for Web sources that use the same query form or that feed into the same application could reuse the same components. As presented in Figure 1, an HTML document is rst retrieved from the Web according to one or more retrieval rules. Currently, a retrieval rule simply consists of the url of the remote document. Once retrieved, the document is fed to an HTML parser that constructs a corresponding parse tree. Given the permissiveness of HTML, the parser has to recover from badly-formed documents. Extraction rules are then applied on the parse tree and the extracted information is stored in an internal format based on nested string lists (NSL), the datatype de ned by NSL = null + string + listof (NSL). Finally, NSL structures are mapped to structures exported by the wrapper to the upper-level application, according to mapping rules. +70/ WUHH +70/ GRFXPHQW

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wrapper Maintenance: A Machine Learning Approach

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wr...

متن کامل

WysiWyg Web Wrapper Factory (W4F)

In this paper, we present the W4F toolkit for the generation of wrappers for Web sources. W4F consists of a retrieval language to identify Web sources, a declarative extraction language (the HTML Extraction Language) to express robust extraction rules and a mapping interface to export the extracted information into some userde ned data-structures. To assist the user and make the creation of wra...

متن کامل

NIE: An Approach for Extracting Information from Narrative Web Information Sources

The World Wide Web (WWW) has become an indispensable repository of valuable information on a wide range of different subjects. However many web information sources (WISs) present their information in a semi-structured format, which made it uneasy to directly extract and manipulate the information. Consequently, there have been many attempts to develop some approaches that can automatically gene...

متن کامل

Semi-Automatic Wrapper Generation for Internet Information Sources

To simplify the task of obtaining information from the vast number of information sources that are available on the World Wide Web (WWW), we are building tools to build information mediators for extracting and integrating data from multiple Web sources. In a mediator based approach, wrappers are built around individual information sources, that provide translation between the mediator query lan...

متن کامل

A Semantic Caching Scheme for Wrappers in Web Databases

We present a new semantic caching scheme suitable for wrappers in web databases. Since the web sources in web databases have typically weaker querying capabilities than conventional databases, it is not trivial to apply existing semantic caching schemes directly. We provide a seamlessly integrated query translation and capability mapping between the wrappers and web sources in the semantic cach...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999